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Abstract 


For  a  wide  variety  of  applications,  both  task  and  data  parallelism  must  be  exploited  to  achieve  the  best  possible 
performance  on  a  multicomputer.  Recent  research  has  underlined  the  importance  of  exploiting  task  and  data  parallelism 
in  a  single  compiler  framework,  and  such  a  compiler  can  map  a  single  source  program  in  many  diffeient  ways 
onto  a  parallel  machine.  There  are  several  complex  tradeoffs  between  task  and  data  parallelism,  depending  on  die 
characteristics  of  the  program  to  be  executed  and  the  paformance  parameters  of  the  target  parallel  machine.  This 
makes  it  very  difficult  for  a  programmer  to  select  a  good  mapping  for  a  task  and  data  parallel  program.  In  this  paper 
we  isolate  and  examine  specific  charactoistics  of  executing  programs  that  determine  the  performance  for  different 
maiqiings  on  a  parallel  machine,  and  present  an  automatic  system  to  obtain  good  mappings.  The  process  consists  of 
two  stqis:  First,  an  instrumented  input  program  is  executed  a  fixed  number  of  times  with  different  mi^ipings,  to  build 
an  execution  model  of  the  program.  Next,  the  model  is  analyzed  to  obtain  a  good  final  mapping  of  the  program  onto 
the  processes  of  the  parallel  machine.  The  current  implementation  is  static,  feedback  driven,  although  the  approach 
can  be  extended  to  a  dynamic  system.  We  demonstrate  the  system  with  an  example  program  that  is  a  model  for  many 
applications  in  the  domains  of  signal  processing  and  image  processing. 
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1  Introduction 

Many  applications  can  be  naturally  expressed  as  collections  of  coarse  grain  tasks,  possibly  with  data  parallelism  inside 
them,  liie  Fx  compiler  at  Carnegie  Mellon  is  designed  to  allow  the  programmer  to  express  and  exploit  task  and 
data  parallelism,  and  we  have  demonstrated  the  power  and  value  of  this  approach  [12].  In  this  paper,  we  address  the 
problem  of  mapping  a  task  and  data  parallel  program  to  achieve  the  best  performance. 

There  are  several  reasons  to  support  task  parallelism,  in  addition  to  data  parallelism,  in  a  parallelizing  compiler. 
For  many  applications,  particularly  in  the  areas  of  digital  signal  processing  and  image  processing,  the  problem  sizes 
are  relatively  small  and  fixed,  and  not  enough  data  parallelism  is  available  to  effectively  use  the  large  number  of 
processors  of  a  massively  parallel  machine.  Even  when  adequate  data  parallelism  is  available,  a  data  parallel  mapping 
may  not  be  the  most  efficient  way  to  execute  a  program.  A  purely  data  parallel  approach  forces  all  computations  to 
be  performed  on  a  fixed  set  of  processors,  whose  number  typically  varies  from  tens  to  thousands.  However,  it  is  the 
amount  and  nature  of  parallelism  and  communication,  that  determine  how  well  a  computation  scales  as  the  number  of 
proce:£<!ors  is  increased.  A  complete  application  often  consists  of  a  series  of  computations  with  different  computation 
and  cohununication  requirements.  For  best  performance,  it  may  be  necessary  to  assign  different  sets  of  processors  to 
different  computations,  and  execute  several  different  computations  simultaneously. 

Using  task  and  data  parallelism  together,  it  is  possible  to  map  an  application  in  a  variety  of  ways  onto  a  parallel 
machine.  Consider  an  application  with  three  coarse  grain,  pipelined,  data  parallel  computation  stages,  with  each 
execution  of  a  computation  stage  corresponding  to  a  task.  A  set  of  mappings  for  such  an  application  is  shown  in 
Figure  1.  Figure  1(a)  shows  a  pure  data  parallel  mapping,  where  all  processors  participate  in  all  computation  stages. 
Figure  1(b)  shows  a  pure  task  parallel  mapping,  where  a  subset  of  processors  is  dedicated  to  each  computation  stage. 
It  may  be  possible  to  have  multiple  copies  of  the  data  parallel  mapping  executing  on  different  sets  of  processors,  as 
shown  in  Figure  1(c).  Finally,  a  mix  of  task  and  data  parallelism,  with  replication  is  shown  in  Figure  1(d). 


c)  Replicated  Data  Parallel 
Mapping 


d)  Combination  of  Replicated 
Data  and  Task  Parallel  Mapping 


Figure  1 :  Combinations  of  Data  and  Task  parallel  mappings 


The  fundamental  question  that  we  ate  addressing  in  this  paper  is:  "Which  of  the  many  mappings  that  are  feasible 
when  data  and  task  parallelism  are  used  together  is  optimal,  and  how  can  this  be  determined  automatically  ?" 

A  key  determinant  of  performance  is  communication  locality  within  each  data  parallel  stage,  but  it  has  to  be  traded 
off  against  processor  usage,  memory  requirements,  and  inter-task  communication  cost.  Communication  locality  can  be 
improved  by  executing  each  task  on  a  small  number  of  processors,  but  this  makes  it  difficult  to  load  balance  and  keep  all 
processors  busy,  implies  a  higher  memory  requirement  per  processor,  and  possibly  a  higher  inter-task  communication 
cost,  due  to  a  finer  task  granularity. 

Our  solution  is  to  build  an  execution  model  of  the  application  using  timing  information  from  trial  executions,  and 
analyze  it  to  predict  the  most  efficient  mapping.  Our  current  implementation  is  static  and  feedback  driven,  and  is 
applicable  to  programs  for  which  sample  data  sets  can  capture  the  general  execution  behavior.  In  a  dynamic  system, 
the  scope  is  extended  to  other  programs  where  recent  execution  history  is  a  good  predictor  of  future  execution  behavior. 


2  Overview 


2.1  Programming  and  compiling  task  parallelism 

The  Fx  compiler  supports  task  arid  data  parallelism  [  I2|.  The  base  language  is  Fortran  77  augmented  with  Fortran  90 
array  syntax,  and  data  layout  statements  based  on  Fortran  D  and  High  Performance  Fortran.  Data  parallelism  is 
expressed  using  array  syntax  and  parallel  loops.  The  mam  concepts  in  the  Fx  compiler  are  language  independent,  and 
Fortran  was  chosen  for  convenience  and  user  acceptance.  The  current  target  machine  is  an  iWarp  processor  array  [2|. 

Task  parallelism  is  expressed  in  special  code  regions  called  parallel  sections.  The  body  of  a  parallel  section  is 
limited  to  calls  to  subroutines  called  task-subroutines,  with  each  execution  instance  representing  a  parallel  task,  and 
loops  to  represent  multiple  instances.  Each  task-subroutine  call  is  followed  by  input  and  output  directives,  which 
define  the  interface  of  the  task  subroutine  to  the  calling  routine,  that  is,  they  list  the  variables  in  the  calling  routine  that 
are  accessed  and  modified  in  the  called  task-subroutine.  The  entries  in  the  input  and  output  lists  can  be  scalars,  array 
slices,  or  whole  arrays.  The  task-subroutines  may  have  data  parallelism  inside  them.  A  parallel  section  of  an  example 
program  is  shown  in  Figure  2. 


c$ 

begin  parallel 

do  i  =  1,10 

call  src(A,B) 

c$ 

output:  A,B 

call  pi (A) 

c$ 

input:  A 

c$ 

output:  A  H 

call  p2(B) 

c$ 

input:  B 

c$ 

output:  B 

call  sink (A, B) 

c$ 

input:  A,B 
enddo 

c$ 

.  end  parallel 

Ml  M2 


Task  dependence  graph  and 
partitioning  into  modules 


Replication  and 
Machine  Mapping 


Figure  2:  Compilation  of  task  parallelism 

The  compiler  can  map  and  schedule  instances  of  task-subroutines  in  any  way.  as  long  as  sequential  execution  results 
are  guaranteed' .  During  compilation,  first  the  input  and  output  directives  are  analyzed  and  a  task  level  data  dependence 
and  communication  graph  is  built.  Next,  the  task-subroutines  are  grouped  into  modules.  All  task-subroutines  in  the 
same  module  are  mapped  to  the  same  set  of  processors.  The  modules  may  be  replicated  to  generate  multiple  modules 
instances,  with  each  instance  receiving  and  sending  data  sets  in  round  robin  fashion.  Finally  a  subset  of  the  processor 
array  is  assigned  to  each  module  instance. 

'  Assuming  that  input  and  output  paiameten  are  specified  correctly 
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Figure  2  shows  the  application  of  these  steps  on  a  small  example  program.  The  task  level  dependence  graph  is 
built  ^m  the  program  and  then  partitioned  into  modules  Ml  and  M2.  Module  M2  is  replicated  to  two  instances,  and 
the  program  is  mapped  onto  the  machine. 

2^  Automatic  Mapping 


Parallel  Machine 


Source 

Program 


Figure  3:  Automatic  Mapping  System 

The  subject  of  this  paper  is  automation  of  the  mapping  process  discussed  in  the  previous  subsection,  which  is  otherwise 
driven  by  user  assertions.  The  structure  of  the  automatic  mapping  tool  and  process  is  shown  in  Figure  3.  The 
programmer  provides  a  source  program  with  data  parallel  and  task  parallel  constructs,  as  well  as  a  sample  data  set. 
The  mapping  tool  and  the  parallelizing  compiler  generate  multiple  instrumented  versions  of  the  program,  which  are 
then  executed  on  the  parallel  machine,  generating  runtime  statistics.  The  mapping  tool  uses  these  statistics  to  build  an 
internal  execution  model  of  the  program,  and  analyzes  it  to  generate  the  final  mapping  information.  This  information 
is  provided  to  the  parallelizing  compiler  to  generate  a  final  parallel  program  for  execution. 

3  Execution  model  for  task  and  data  parallel  programs 

An  Fx  parallel  program  consists  of  a  set  of  data  parallel  task-subroutines,  that  are  related  by  date  dependences  between 
them.  Every  active  task-subroutine  is  in  one  of  the  following  states: 

1 .  Waiting  for  a  sender  to  be  ready  to  send  input. 

2.  Receiving  input. 

3.  Executing. 

4.  Waiting  for  a  receiver  to  be  ready  to  receive  output. 

3.  Sending  output. 

By  measuring  the  time  it  spends  in  each  of  these  states,  for  different  mappings,  we  can  determine  the  fundamental 
execution  properties  of  a  task-subroutine,  which  can  in  turn  be  used  to  predict  the  execution  behavior  for  other 
mappings.  In  this  section,  we  present  an  execution  nradel  for  tesk-subrtwtines,  which  consists  of  parameterized 
equations  for  execution  time,  memory  requirement ,  ami  inter-task  communication  time. 
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3.1  Execution  time. 

The  execution  time  of  a  task-subroutine  on  a  set  of  processors  is  mainly  determined  by  the  total  amount  of  computation, 
the  amount  of  available  parallelism,  and  the  communication  cost  incurred  in  exploiting  the  parallelism.  Thus,  the 
execution  time  of  task-subroutine  T  executing  on  P  processors  can  be  expressed  as: 

n(P)  =  C^-KCi//»)  +  (C^*P) 

The  values  of  the  Q  parameter  describes  the  execution  time  characteristics  of  a  specific  task-subroutine.  The  constant 
term  reflects  the  time  spent  on  sequential  computation  in  the  program,  as  well  as  the  fixed  part  of  the  communication 
cost.  The  second  term  represents  the  time  spent  in  parallel  computation,  which  decreases  linearly  with  the  number  of 
processors.  The  last  term  represents  the  part  of  the  communication  cost  that  varies  with  the  number  of  processors,  as 
the  parallelism  in  communication  increases,  but  the  granularity  of  communication  becomes  finer. 

3.2  Inter^task  communication  time 

The  transfer  of  data  between  a  pair  of  task-subroutines  is  of  two  different  forms.  If  the  two  task-subroutines  are  mapped 
on  the  same  set  of  processors,  the  data  transfer  is  a  potential  local  redistribution  of  data.  If  the  task-subroutines  are 
mapped  on  different  sets  of  processors,  there  is  movement  of  data  from  one  processor  subairay  to  another  processor 
subarray.  In  both  cases,  the  cost  is  dependent  on  the  volume  of  the  data  to  be  transferred,  and  the  number  of  processors 
involved.  Increasing  the  number  of  processors  implies  a  potentially  higher  degree  of  parallelism  in  communication, 
but  may  also  increase  the  associated  overhead. 

We  model  the  data  transfer  cost  between  task-subroutines  executing  on  the  same  set  of  P  processors  (data  parallel 
style)  as: 

And  for  a  pair  of  task  subroutines  executing  on  P\  and  Pi  processors  respectively  (task  parallel  style)  as: 

r,p{P\ .  P2)  =  +  (Cj,  *  F, )  +  (C^  .  Pi) 


33  Memory  requirement 

The  memory  requirement  of  a  task-subroutine  is  an  important  parameter.  In  a  multicomputer  that  has  nodes  with  a 
fixed  amount  of  uniform  storage,  the  memory  requirement  determines  the  set  of  feasible  mappings.  In  the  presence  of 
a  memory  hierarchy,  the  memory  requirement  plays  an  important  role  in  determining  overall  performance. 

The  memory  requirement  is  closely  related  to  the  way  a  parallelizing  compiler  allocates  and  manages  memory. 
We  outline  the  main  memory  requirements  of  a  parallel  program,  and  state  how  they  are  managed  by  our  compilation 
system; 

1.  Global  and  system  variables:  Allocated  for  the  duration  of  execution. 

2.  Local  variables:  Allocated  on  the  stack  as  execution  proceeds. 

3.  Compiler  buffers  and  variables:  Allocated  dynanucally  on  the  heap  as  execution  proceeds. 

Global  and  system  variables  are  allocated  identically  for  all  modules  and  processors.  Local  variables  are  allocated 
at  run  time  on  the  stack  as  the  task-subroutines  are  executed.  Sufficient  memory  for  the  conununication  and  other 
buffers  is  allocated  on  entry  to  a  subroutine,  and  deallootted  on  exit.  If  an  array  argument  is  to  be  redistributed  as  a 
result  of  a  a  subroutine  call,  a  compile  variable  related  to  the  size  of  the  argument  is  allocated  for  the  duration  of  the 
execution  of  the  subroutine.  If  the  argument  is  to  be  transferred  from  a  task-subroutine  that  belongs  to  another  module, 
such  a  compiler  variable  is  not  needed,  since  the  argument  is  placed  in  the  appropriate  distribution  on  transfer  between 
modules. 

Consider  the  memory  requirement  of  a  processor  that  is  executing  a  module  M  with  task-subroutines  T\ ,  Ti,  ...r«. 
If  the  local  memory  requirement  of  task-subroutine  Ti  is  and  the  global  memory  requirement  for  the  program  is 
ftgm,  then  the  total  memory  requirement  fiutat  for  module  M  is; 

Putal  = 
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max  is  taken  since  local  memory  for  subroutines  is  allocated  on  stack.  If  a  module  is  executing  on  a  set  of  P  processors, 
memory  required  per  processor  to  hold  global  variables  is: 

where  and  represent  the  memory  requirement  due  to  replicated  and  distributed  variables  respectively. 

The  memwy  requirement  of  individual  task-subroutines  is  dependent  on  arguments  that  are  redistributed  on 
subroutine  entry.  If  the  task  subroutine  that  sends  the  arguirwnt  is  part  of  the  same  module,  then  additional  memory 
is  allocated  for  redistribution.  If  a  task-subroutine  has  k  arguments  that  are  potentially  redistributed  on  entry,  then  the 
per  processor  memory  requirement  for  holding  local  variables  and  parameters  can  be  modeled  as: 

k 

=  C  +  iCl/P) + 

<«i 

where  and  Cj^/P  represent  the  memory  requirements  of  local  replicated  and  distributed  variables  respectively, 
is  the  size  of  the  ith  redistributable  argument  and  bi  is  1  if  the  task-subroutine  from  which  the  argument  is  sent  is  in  the 
same  module  as  T,  and  0  otherwise. 


4  Deriving  the  execution  model  parameters 

We  restate  the  parameterized  equations  for  task  execution  time  and  inter-task  communication  time,  from  the  previous 
section: 


rAP) 

=  •  (^,+(C'jP)  +  (C\*P) 

(1) 

TdpiP) 

=  cSp  +  (cip*f») 

(2) 

=  c^+(c];,*p,)+(cj,*P2) 

(3) 

We  have  to  determine  the  actual  Q,  Cj^,  and  parameters  for  each  task-subroutine.  Consider  equation  (I).  By 
e.~%uting  a  task-subroutine  three  times  on  different  number  of  processors,  and  measuring  the  execution  time,  we  obtain 
three  equations  in  three  unknowns,  which  can  be  solved  in  a  straightforward  way  to  obtain  all  C,  values.  Similarly, 
by  measuring  the  data  transfer  time  for  two  different  executions  for  equation  (2),  and  three  different  executions  for 
equation  (3),  the  unknown  parameters  can  be  determined  by  solving  systems  of  linear  equations. 

We  omit  the  details  of  the  derivation  of  memory  requirement  parameters,  and  simply  state  that  they  can  be  inferred 
using  compile  time  information  and  measurements  of  stack  and  heap  allocations,  made  during  execution  with  two 
different  mappings. 

1  All  the  parameters  of  the  execution  model  described  in  Section  3  can  be  obtained  by  executing  a  program 
at  most  5  times. 

Proof:  The  parameters  for  data  parallel  and  task  parallel  communication  can  be  obtained  (for  instance)  by  executing 
the  program  in  pure  data  parallel  mode  twice  and  in  pure  task  parallel  mode  three  times,  that  is  5  distinct  executions. 
The  parameters  for  execution  time  and  memory  requirement  each  require  two  different  executions  and  can  be  included 
in  the  set  of  S  executions  stated  above. 

We  wish  to  point  out  that  the  final  parameterized  equations  only  approximate  the  real  behavior  of  the  program.  In 
practice,  however,  they  provide  sufficient  information  for  the  mapping  process. 


5  Building  modules  from  tasks 

Once  the  parameters  of  the  execution  ntKxlel  for  a  program  are  established,  it  is  possible  to  predict  the  total  execution 
time  for  any  program  nuqiping.  Tlw  objective  is  to  find  the  optimal  mapping  for  a  given  program  on  a  given  set  of 
processors.  To  establish  the  mapping  of  a  program  onto  a  set  of  processors,  the  following  decisions  have  to  be  made: 

1 .  Partitioning  of  task-subroutines  into  modules. 
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2.  Possible  replication  of  modules. 

3.  Allocation  of  processors  to  module  instances. 

In  this  section,  we  address  the  problem  of  partitioning  task-subroutines  into  modules.  We  initially  place  every 
task-subroutine  in  a  module  by  itself,  and  then  selectively  merge  pairs  of  modules,  whenever  it  is  considered  prolitable. 
We  restate  that  all  task-subroutines  in  the  same  module  are  mapped  to  the  same  set  of  processors. 

One  reason  why  the  mapping  problem  is  difficult  is  that  the  three  subproblems  listed  above  are  interdependent, 
that  is,  the  optimal  solution  for  each  depends  on  the  solutions  of  others.  We  present  a  set  of  results  that  address  the 
problem  of  partitioning  tasks  into  modules,  with  some  assurance  that  the  final  mapping  obtained  from  this  partitioning 
is  close  to  optimal.  In  the  next  section,  we  address  replication  of  modules  and  processor  allocation. 

5.1  Theoretical  complexity 

It  is  fairly  easy  to  show  that  the  problem  of  optimally  mapping  a  set  of  parallel  tasks  to  a  set  of  processors  is  in  a  class 
of  multiprocessor  scheduling  problems  that  are  NP-hard.  For  this  purpose,  we  consider  a  parallel  machine  with  only 
two  processors,  and  assume  that  there  are  no  memot^  or  dependence  constraints.  The  restricted  optimization  problem 
obtained  can  be  stated  as  follows: 

Given  a  set  of  tasks  T\,Ti,  with  execution  time  r(i)  for  task  T,.  Divide  the  tasks  into  two  groups  such  that 
the  maximum  of  the  sum  of  the  two  execution  times  in  the  two  sets  is  minimized. 

This  problem  can  be  shown  to  be  NP-hard  by  transforming  the  Partition  problem  (7]  to  the  above  problem. 

5.2  Pair  of  tasks 

We  first  consider  a  pair  of  task-subroutines  that  are  assigned  to  different  modules,  and  examine  the  performance 
implications  of  merging  the  modules.  There  are  several  reasons  why  it  may  be  profitable  to  assign  a  pair  of  task- 
subroutines  to  different  modules; 

•  More  processors  may  be  used  effectively  by  two  different  modules  than  one. 

•  Replication  and  processor  assignment  decisions  for  the  task-subroutines  are  decoupled,  allowing  more  flexibility. 

•  Individual  memory  requirements  for  each  module  may  be  lower  than  the  memory  requirement  of  a  single  merged 
module,  allowing  each  instance  to  fit  in  a  smaller  number  of  processors  and  improving  communication  locality. 

However,  the  cost  of  transferring  data  between  task-subroutines  can  increase  when  they  belong  to  different  modules, 
and  are  mapped  on  different  parts  of  the  processor  array.  In  particular,  when  the  distribution  of  an  array  item  is  the 
same  for  the  two  task-subroutines,  the  data  transfer  cost  is  zero  when  they  belong  to  the  same  module,  but  can  be 
significant  when  they  are  in  different  modules. 

It  is  possible  to  make  the  decision  on  merging  the  modules  in  a  provably  correct  way,  if  the  following  condition 
holds  for  the  corresponding  task-subroutines: 

1 .  The  task-subroutine  scales  linearly,  including  inter-task  communication  cost,  or 

2.  The  task-subroutine  can  be  replicated,  and  the  total  number  of  processors  available  is  much  greater  than  the 
minimum  number  needed  to  map  the  subroutine. 

This  condition  allows  us  to  assign  fixed  execution  speed  (pa  processor),  with  task-subroutines.  If  a  task-subroutine 
scales  perfectly,  then  this  is  trivial.  Otherwise,  we  pick  the  fastest  feasible  execution  rate,  which  corresponds  to 
execution  with  the  smallest  number  of  processors  that  the  task-subroutine  can  execute  on.  since  that  provides  the  best 
communication  locality.  Under  this  condition,  the  following  lemma  holds: 

Lemma  2  Consider  a  program  consisting  of  two  task-subroutines  T\  emd  Tz,  which  can  be  mapped  to  individual 
modules  M\  and  Mz.  or  a  single  module  M\z.  Let  the  minimum  number  of  processors  required  to  fit  the  data  sets  of 
corresponding  modules  be  P\,  Pz  and  P\z-  The  criterion  for  placing  both  the  task-subroutines  in  the  same  module  is: 

whererl,  -rj^  andrjl'^  are  the  execution  times  (including  communication)af  the  corresponding  modules,  on  P\,  Pzand 
P\z  processors,  respectively. 
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The  result  is  obtained  by  comparing  the  predicted  speeds  of  execution  for  the  two  cases,  and  is  our  basis  for  di  viding 
task-subroutines  into  modules.  The  condition  is  verified  in  a  quantitative  sense,  and  in  our  experience,  usually  holds. 
An  iterative  approach  is  used  when  the  lemma  cannot  be  used  directly. 

5  J  Task  graphs 

We  now  address  the  problem  of  partitioning  a  task  graph,  which  is  required  to  be  acyclic,  except  for  seif  dependences. 
The  result  from  last  section  can  be  used  to  make  pairwise  decisions,  but  the  order  in  which  the  pairs  are  chosen  can 
influence  the  final  partitioning.  Consider  the  task  graph  in  Figure  4.  Task-subroutines  T| ,  Tz  and  T3  require  a  minimum 


Arrow  thickness  reflects  volume  of  communication 

PI  =6 
P2  =  12 
P3  =  6 


Optimal  Partition:  {(T1,  T3),  (T2)} 

Other  Partitions :  {(T1.T2),(T3)}.{(T1.T2.T3)} 


Figure  4;  Partitioning  a  task  graph  into  modules 

of  6,  12  and  6  processors,  respectively,  to  execute.  For  simplicity,  we  assume  that  the  minimum  number  of  processors 
required  to  execute  a  module  is  the  maximum  of  the  processors  required  for  individual  subroutines  inside  the  module. 
Also,  suppose  T-i  is  a  communication  intensive  task-subroutine  that  does  not  scale  well,  while  others  scale  linearly, 
and  the  volume  of  communication  along  the  edge  T|->T3  is  the  only  major  inter-task  communication  cost,  which  is 
non-existent  if  those  two  task-subroutines  are  mapped  to  the  same  module. 

The  optimal  partitioning  into  modules  is  {(T| ,  T3),  (Ti)},  which  is  obtained  if  we  examine  T\  and  T-i  first,  and  merge 
them  into  a  module.  However  if  we  examine  T\  and  Tj,  first,  we  may  merge  them,  and  then  the  possible  partitionings 
are  {(Ti,  Tz),  (r3)},  which  implies  that  the  cost  of  communication  from  T|->T3  has  to  be  paid,  or  (Ti ,  Ti,  Ti)  which 
implies  that  Ti,  that  does  not  scale  well,  will  be  forced  to  use  at  least  12  processors  per  instance  instead  of  6,  which  is 
wasteful. 

In  general,  the  problem  is  that  we  may  combine  a  set  of  compute  intensive  routines  in  one  module  first  (with  limited 
benefit),  which  may  leave  a  communication  intensive  routine  with  limited  choices.  Based  on  these  observations,  we 
use  the  following  heuristic  to  pick  a  pair  of  modules  for  a  potential  merger: 

•  Higher  priority  to  modules  that  do  not  scale  well  as  number  of  processors  is  increased. 

•  Higher  (niority  to  pairs  of  modules  that  need  roughly  the  same  number  of  processors,  since  the  decision  influences 
othCT  decisions  less. 

The  pairwise  selection  method  discussed  here  is  effective  for  task  graphs  thtt  are  trees  and  our  tool  sA  uses  it.  We 
are  motivated  by  the  fact  that  a  large  class  of  applications  lead  to  task  graphs  that  sue  trees,  often  just  straight  line 
graphs.  There  is  some  additional  complexity  involved  in  the  partitioning  of  acyclic  task  graphs  that  are  nm  trees,  and 
we  expect  to  address  that  in  future  publications. 

An  obvious  alternate  ^rproach  is  to  exhaustively  try  ail  orders  and  use  the  best  results  obtained.  Although  this 
approach  does  not  seem  appealing,  it  can  be  used  effectively  in  many  situations,  since  task  graphs  in  a  data  and  task 
parallel  program  often  contain  only  a  few  nodes. 
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6  Mapping  modules  to  processors 

Once  the  task-subroutines  have  been  partitioned  into  modules,  we  divide  the  processors  among  the  modules,  and 
replicate  the  modules  when  legal  and  profitable.  Replication  means  executing  multiple  instances  of  modules  on 
different  parts  of  the  processor  array,  and  is  legal  when  none  of  the  task-parameters  in  the  module  has  a  self  dependence, 
or  carries  state  between  instances.  The  multiple  module  instances  share  work  by  processing  data-sets  in  a  round  robin 
fashion.  Our  procedure  is  as  follows: 

1 .  Divide  the  processors  among  modules  based  on  the  best  estimate  of  their  execution  rate. 

2.  Map  the  modules  onto  the  processor  array,  replicating  whenever  it  is  possible  and  profitable. 

6.1  Allocating  processors  to  modules 

The  objective  of  processor  allocation  is  that  every  module  should  execute  in  approximately  the  same  amount  of  time, 
thus  minimizing  load  imbalance.  The  execution  time  of  a  module  containing  a  set  of  task-subroutines,  including  the 
communication  time,  on  P  processors,  can  be  modeled  as: 

r"(/>)  =  +  (C'JP)  +  +  P- 

where  each  term  in  the  summation  refers  to  communication  with  another  module.  The  parameters  can  be  computed 
directly  from  the  parameters  of  the  task-subroutines  that  constitute  the  module.  The  summation  term  involves  the 
number  of  processors  allocated  to  other  modules  that  this  module  communicates  with.  On  per  processor  basis,  the  best 
performance  is  achieved  when  an  instance  of  the  module  executes  on  the  smallest  number  of  processors.  Thus,  if  each 
instance  of  the  modui requires  at  least  Po  processors  to  execute,  and  the  modules  that  it  communicates  with  are  also 
assumed  to  execute  on  the  smallest  set  of  processors  (Pk)),  the  execution  time  corresponding  to  the  best  per  processor 
performance  is: 

r!^(Po)  =  +  (C'jPo)  +  (Cl  *Po)  +  Y.C,*  />K) 

which  is  a  constant.  And  the  execution  rate  of  module  M  is: 

X(M)=l/(Ti'iPo)*Po) 


in  datasets/second/processor. 

Once  fixed  per  processor  execution  rates  are  assigned  to  modules,  the  problem  of  finding  an  optimal  allocation  can 
be  stated  as: 

Given  a  set  of  modules  M\,  Ml,  ...Mb  with  execution  rate  X  (Mi)  for  module  Mi,  divide  the  available  processors  P 
into  k  sets  Pi,  Pi,..  such  that  mox^i  ,*(P,  *  X(Mi))  is  minmizied. 

We  first  solve  the  problem  exactly  assuming  Pfi  are  allowed  rational,  i.e.  non  integral  values.  We  assign  an 
arbitral  value  to  (say)  Po,  then  solve  for  P\ ,  Pi,  ...Pt  using  Pi  *  X(Mi)  =  Po*  X(Mo)  and  scaling  all  P,  values  such 
that  (Pi)  =  P.  Actual  numbers  of  processors .  .jsigned  to  each  module,  of  course,  must  be  integers.  Once  a  rational 
solution  is  known,  we  simply  find  an  approximate  integer  solution  that  is  close  to  the  rational  solution. 

As  in  the  last  section,  the  processor  assignment  procedure  is  based  on  finding  a  fixed,  per  processor,  execution 
rates  for  modules.  When  this  is  not  possible,  an  iterative  approach  is  used. 

6.2  Replication  and  processor  assignment 

As  stated  before,  replication  of  a  module  is  legal  if  none  of  the  task-subroutines  in  it  has  a  self  cycle.  Even  when 
replication  is  legal,  it  may  not  be  feasible  or  desirable  for  the  following  reasons. 

•  Enough  processors  may  not  be  available. 

•  When  the  communication  paradigm  requires  reservation  of  machine  resources  (e.g.  long  lived  connections  in 
some  parallel  machines  and  networks),  sufficient  resources  may  not  be  available. 

•  A  large  number  of  module  instances  increase  the  possibility  of  slowdown  due  to  sharing  of  communication 
resources. 
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We  make  the  actual  replication  decisions  heuristically,  taking  the  above  factors  into  account.  We  omit  the  details, 
but  typically  the  result  is  that  the  communication  intensive  routines  that  cannot  use  a  large  number  of  processors  in  one 
instance  effectively,  are  more  likely  to  be  replicated,  while  subroutines  that  scale  well  ore  less  likely  to  be  replicated. 

Finally,  we  have  to  assign  the  physical  processors  to  module  instances.  This  is  done  such  that  the  sharing  of 
communication  resources  between  modules  is  minimized.  In  our  current  implementation,  module  instances  can  be 
mapped  only  to  rectanguitu  subarrays  of  processors. 

It  may  be  useful  to  re-execute  the  program  with  the  predicted  mapping,  and  collect  information  for  tine  tuning.  We 
expect  to  be  able  to  comment  on  the  importance  of  this  step  with  more  experience. 

7  Application  to  an  example  program 

We  illustrate  the  process  of  automatic  mapping  by  demonstrating  how  our  toolset  maps  an  example  program  onto  a 
64  processor  iWarp  array.  The  program  consists  of  a  5 12  point  2DFFT,  implemented  as  a  sequence  of  1  DFFTs  with 
a  transpose  between  them,  followed  by  a  statistical  analysis  routine.  The  three  task-subroutines  corresponding  to  the 
computation  stages  are  coif  fts,  rowf  fts  and  hist  (for column  FFTs,  row  FFTs  and  Histograms),  respectively. 
Figure  5  shows  the  structure  of  the  computation,  and  a  task-subroutine  call  level  code  for  the  program. 


c$ 

begin  parallel 

do  i  =  1 , m 

call  coif fts (A) 

c$ 

output:  A 

call  rowf fts (A) 

c$ 

input:  A 

c$ 

output:  A 

call  hist(A) 

c$ 

input:  A 
enddo 

c$ 

end  parallel 

Figure  5:  Structure  of  the  example  program 

The  first  two  task-subroutines  are  completely  parallel,  while  the  third  task-subroutine,  hist,  contains  significant 
communication.  This  example  was  chosen  because  it  demonstrates  the  tradeoffs  between  different  mapping  styles, 
and  more  important,  it  reflects  the  structure  of  a  large  class  of  applications  in  digital  signal  processing  and  image 
processing. 

7.1  Obtaining  execution  parameters 

The  program  is  executed  for  a  set  of  different  mappings,  and  the  instrumentation  generates  timing  information  for 
the  execution  of  the  task-subroutines,  and  for  communication  between  them.  The  mappings  used  for  this  example 
are  shown  in  Figure  6.  In  accordance  with  Lemma  1,  these  S  mappings  are  sufficient  to  derive  the  parameters  of  the 
program  model  presented  in  section  3.  The  measured  execution  and  communication  times  are  emulated  in  Figure  7. 

Analysis  of  these  measurements  using  the  methods  discussed  in  section  4,  yields  the  execution  parameters  of  the 
program.  The  execution  times  of  the  task-subroutines  in  terms  of  the  number  of  processors  are  as  follows: 

rj  (P)  =  -0.97  +  (0.01 1  ♦  F)  +  (600.4/F) 
r^^(P)  =  -6.42  + (0.072  ♦P)  + (639.0/P) 
r^(P)  =  15.27  + (0.231  ♦P)  + (252.2/P) 

The  estimated  execution  time  function  is  plotted  in  Figure  8  and  shows  that  the  FFT  stages  scale  very  well,  while 
the  histogram  stage  does  not.  This  reflects  the  computations  in  the  task-subroutines  -  the  first  two  are  data  parallel, 
while  the  third  is  communication  intensive. 
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P1=12,  P2=16.  P3=32  P1=16.  P2=32,  P3=12  P1=32,  P2=12,  P3=16 


P1  ,P2.P3=64  PI  ,P2,P3=32 


Figure  6:  Mappings  for  analyzing  the  example  program 


1  Execution  Time  (10~^««.) 

Tl 

T2 

T3 

49.18 

47.67 

39.06 

18.12 

15.82 

30.54 

9.11 

8.16 

34.00 

Processors 
(send-task  -> 
recv-task) 


32->32 


64->64 


12->16 


16->32 


32->12 


Communication  Time  ( 10” ■'secs.) 


Data  Parallel 


Task  Parallel 


T1->T2  I  T2->T3)  |  T1->T2  |  T2->T3 


9.57 

9.61 

9.65 

9.63 

9.80 

9.95 

Figure  7:  Timings  from  evaluation  runs  of  the  example  program 


The  communication  parameters,  also  derived  from  the  measurements  in  Figure  7,  are  as  follows: 

=  5.86- (0.052 ♦?) 

T^>\P)  =  0. 10- (0.001  ♦P) 

=  9.39  +  (0.012  •P,)  + (0.002  ♦P2) 

=  9.46  +  (0.016  *P2)- (0.003  ♦Pj) 

We  note  that  there  is  some  cost  of  dau  parallel  communication  between  task-subroutines  Ti  and  T2,  but  practically 
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Figure  8:  Scaling  of  example  program  subroutines 


none  between  Ti  and  T^.  This  is  because  the  former  involves  a  matrix  transpose,  while  the  latter  requires  no  data 
movement.  The  task  parallel  communication  time  is  practically  constant.  In  our  implementation,  the  communication 
between  tasks  is  done  systolically  on  a  single  link  with  a  fixed  bandwidth,  and  hence  the  communication  time  primarily 
depends  on  the  volume  of  data  being  transferred,  which  is  the  same  in  all  cases. 

We  omit  the  details  of  the  memory  requirement  analysis  and  state  the  final  results.  A  processor  in  our  target 
machine  (iWarp)  has  a  fixed  amount  of  main  memory.  The  minimum  number  of  processors  required  to  fit  the  data  set 
of  each  of  the  three  task-subroutines  is  10,  which  is  the  same  since  the  main  data  structure  is  the  same  array.  However, 
when  task-subroutine  Ti  is  mapped  to  the  same  module  as  Tz  or  Tj,  a  minimum  of  20  processors  are  required,  since 
memory  for  the  original  and  the  transposed  array  must  be  allocated. 

IJ.  Modules  from  tasks 

We  initially  put  each  task-subroutine  in  a  module  by  itself,  and  compare  pairs  of  modules  that  have  a  communication 
edge  between  them,  to  determine  if  the  modules  should  be  merged.  In  the  example  program,  we  have  modules  M\ , 
Mz  and  A/3  containing  the  three  task-subroutines.  Using  the  priority  system  discussed  in  S.3,  we  first  examine  the  pair 
{Mz,Mz),  since  A/3  contains  the  least  scalable  task-subroutine.  Using  Lemma  2,  we  compute: 

{Pzrf^  +  PiT^)l{P-aTf^)  =  1.189  >  1 

where  Pz,P-i  =  10.  Pz3  =  10  and  execution  times  are  computed  from  the  equations  for  the  corresponding  modules. 
Since  the  above  expression  >  1,  we  conclude  that  the  mc^ulrs  should  be  combined,  yielding  a  new  module  A/23. 
The  execution  parameters  of  A/23  are  obtained  by  adding  the  execution  parameters  of  Pz  and  Pz  and  the  data  parallel 
communication  parameters  for  communication  between  them.  We  obtain: 

A/f(P)  =  8.95  + (0.302  *P)  + (891. 2//») 


As  before,  we  test  if  the  modules  M\  and  A/23  should  be  nwged  and  obtain: 

(F.t^^'  +F23r?^)/(/»i23Tf’“)=  .945  <  1 
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and  therefore  the  modules  are  not  combined.  Thus  the  final  assignment  of  task-parameters  to  modules  is: 

Mx  =  {(r,)} 

W23  =  {(7-2,73)} 

73  Allocation  of  processors 

The  processors  are  allocated  on  the  basis  of  the  relative  execution  speed  for  the  minimum  number  of  processors  that 
the  modules  run  on,  including  the  conununication  time.  In  this  case,  we  have: 

And  for  a  64  processor  machine,  we  get 
P\  =  24.9  «  25 
A»23  =  39.1  ss  39 

7.4  Replication 

The  replication  decisions  are  based  on  the  scalability  of  different  modules  -  the  less  scalable  are  prioritized.  We 
measure  the  execution  time  for  running  each  module  on  all  the  processors  allocated  to  it,  and  with  only  half  of  the 
processors  allocated,  and  pick  the  module  which  shows  the  least  speedup..  For  our  example,  we  get: 

rf' (12.5)/Tf' (25)  =  2.02 
Tf“(19.5)/T^"(39)=  1.39 

Hence  module  Mn  is  selected  for  replication.  It  has  39  processors  allocated,  and  each  instance  needs  at  least  10.  so 
we  overallocate,  and  have  four  instances,  each  executing  on  10  processors,  consuming  40  processors.  The  remaining 
24  processors  are  allocated  to  module  .  Since  module  shows  nearly  linear  speedup,  it  is  not  replicated. 

Figure  9  summarizes  the  steps  in  mapping  the  example  program,  and  shows  the  final  mapping.  Figure  1 0  compares 
the  performance  of  this  mapping  to  other  mappings. 

8  Discussion 

We  are  using  the  results  of  this  research  for  developing  several  applications  with  the  Fx  compiler.  We  obtained  four 
fold  improvement  over  a  fiilly  data  parallel  Narrowband  tracking  radar  benchmark  [11],  mainly  because  the  data 
parallel  version  could  use  only  a  small  numba*  of  cells  effectively.  Other  applications  under  development  include  SAR 
(Synthetic  Aperture  Radar),  Multibaseline  Stereo,  and  MRI  (Magnetic  Resonance  Imaging).  While  the  discussion 
of  these  applications  is  beyond  the  scope  of  this  paper,  they  all  have  multiple  stages  with  different  computation 
requirements,  similar  to  the  model  program  discussed  in  section  7. 

The  main  limitation  of  our  approach  is  that  it  can  be  used  only  for  programs  where  the  history  of  execution  is  a  good 
indicator  of  the  future  computation  and  communication  requirements.  In  the  current  implementation,  the  mapping  is 
fixed  during  execution,  and  cannot  change  with  runtime  behavior.  While  our  approach  is  not  applicable  to  programs 
whose  runtime  behavior  is  completely  unpredicatable,  it  can  be  adapted  for  programs  where  the  runtime  behavior  can 
change,  but  recent  history  is  still  a  good  predictor  of  near  future,  'ih  are  planning  to  develop  an  implementation  that 
supports  dynamic  renuipping  based  on  changes  in  runtime  behavior.  While  this  will  add  considerable  complexity  to 
the  system,  the  fundamental  approach  remains  the  same. 

We  have  selected  a  fairly  simple  execution  model  and  many  refinements  are  possible.  We  model  the  execution 
time  as  a  quadratic  functions,  and  the  communication  times  as  linear  functions.  We  also  ignore  several  secondary 
effects,  for  instance,  the  effect  on  performance  due  to  sharing  of  the  global  machine  resources.  Our  guiding  principle 
has  been  to  develop  a  simple  model  and  implementation  that  can  be  used  effectively  and  conveniently  for  developing 
applicsttions.  We  expect  experience  to  guide  us  into  refinemoit  of  the  automatic  mapping  tool. 

Plans  for  future  research  include  development  of  an  interactive  tool  to  guide  program  mapping,  and  to  port  the 
implementation  to  other  parallel  computers  and  high  speed  networks.  We  ate  also  in  the  process  of  identifying  more 
applications  which  can  profit  from  an  integrated  task  and  data  pvallel  compiler. 
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Figure  9:  Mapping  steps  and  the  final  mapping  of  the  example  program 
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Figure  10:  Relative  performance  of  the  final  mapping 


9  Related  Work 

Compilation  and  optimization  of  programs  for  private  memory  parallel  computers  has  been  a  very  active  area  of 
research  for  several  years.  Several  parallelizing  compilers  have  been  developed  for  data  parallel  programs,  including 
Fortran  D  [13]  and  Vienna  Fortran  [3],  and  for  task  parallel  programs  [8,  9).  Recent  research  shows  that  a  large 
class  of  applications  contain  task  and  data  parallelism  [6]  and  it  is  important  to  exploit  them  in  a  single  compiler 
framework  [4, 5,12].  There  is  also  a  large  body  of  literature  on,  partitioning,  load  balancing  and  scheduling  of  parallel 
programs  [1, 10]. 

We  have  addressed  the  specific  partitioning  and  load  balancing  issues  that  arise  when  task  and  data  parallelism  are 
combined  in  a  parallelizing  compiler,  including  memory  requirement  issues  that  are  important  but  often  ignored,  and 
developed  a  practical  system  to  efficiently  compile  and  map  task  and  data  parallel  programs.  An  alternate  approach, 
taken  in  Jade  [8]  is  to  express  ail  parallelism  as  coarse  grain  tasks,  and  make  scheduling  decisions  at  runtime.  This  is 
particularly  useful  when  runtime  behavior  is  unpredictable,  but  may  entail  a  higher  overhead.  Most  applications  have 
components  that  have  simple  data  parallelism,  and  we  believe  that  it  is  extremely  important  to  use  an  optimizing  data 
parallel  compiler  for  integrated  task  and  data  parallel  systems,  and  statically  schedule  and  optimize  computations  and 
communication  whenever  feasible. 


10  Conclusions 


We  have  examined  the  lundamental  characteristics  of  task  and  data  parallel  programs  that  determine  their  performance, 
and  presented  an  execution  model  for  such  applications.  We  used  this  model  and  analysis  techniques  to  build  a  tool  to 
automatically  map  parallel  programs  onto  a  parallel  machine.  We  are  using  this  tool  to  develop  a  class  of  applications 
that  need  to  exploit  task  and  data  parallelism  for  efficient  execution. 
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